{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/DS4SD/docling/blob/main/docs/examples/pictures_description.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install -q docling[vlm] ipython"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from docling.datamodel.base_models import InputFormat\n",
    "from docling.datamodel.pipeline_options import PdfPipelineOptions\n",
    "from docling.document_converter import DocumentConverter, PdfFormatOption"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The source document\n",
    "DOC_SOURCE = \"https://arxiv.org/pdf/2501.17887\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Describe pictures with Granite Vision\n",
    "\n",
    "This section will run locally the [ibm-granite/granite-vision-3.1-2b-preview](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview) model to describe the pictures of the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "93a634699bf1434c9bc8e384d6db1a28",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from docling.datamodel.pipeline_options import granite_picture_description\n",
    "\n",
    "pipeline_options = PdfPipelineOptions()\n",
    "pipeline_options.do_picture_description = True\n",
    "pipeline_options.picture_description_options = (\n",
    "    granite_picture_description  # <-- the model choice\n",
    ")\n",
    "pipeline_options.picture_description_options.prompt = (\n",
    "    \"Describe the image in three sentences. Be consise and accurate.\"\n",
    ")\n",
    "pipeline_options.images_scale = 2.0\n",
    "pipeline_options.generate_picture_images = True\n",
    "\n",
    "converter = DocumentConverter(\n",
    "    format_options={\n",
    "        InputFormat.PDF: PdfFormatOption(\n",
    "            pipeline_options=pipeline_options,\n",
    "        )\n",
    "    }\n",
    ")\n",
    "doc = converter.convert(DOC_SOURCE).document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<h3>Picture <code>#/pictures/0</code></h3><img src=\"\" /><br /><h4>Caption</h4>Figure 1: Sketch of Docling's pipelines and usage model. Both PDF pipeline and simple pipeline build up a DoclingDocument representation, which can be further enriched. Downstream applications can utilize Docling's API to inspect, export, or chunk the document for various purposes.<br /><h4>Annotations (ibm-granite/granite-vision-3.1-2b-preview)</h4>In this image we can see a poster with some text and images.<br />\n",
       "<hr /><h3>Picture <code>#/pictures/1</code></h3><img src=\"\" /><br /><h4>Caption</h4>Figure 2: Dataset categories and sample counts for documents and pages.<br /><h4>Annotations (ibm-granite/granite-vision-3.1-2b-preview)</h4>In this image we can see a pie chart. In the pie chart we can see the categories and the number of documents in each category.<br />\n",
       "<hr /><h3>Picture <code>#/pictures/2</code></h3><img src=\"\" /><br /><h4>Caption</h4>Figure 3: Distribution of conversion times for all documents, ordered by number of pages in a document, on all system configurations. Every dot represents one document. Log/log scale is used to even the spacing, since both number of pages and conversion times have long-tail distributions.<br /><h4>Annotations (ibm-granite/granite-vision-3.1-2b-preview)</h4>In this image we can see a graph. On the x-axis we can see the number of pages. On the y-axis we can see the seconds.<br />\n",
       "<hr /><h3>Picture <code>#/pictures/3</code></h3><img src=\"\" /><br /><h4>Caption</h4>Figure 4: Contributions of PDF backend and AI models to the conversion time of a page (in seconds per page). Lower is better. Left: Ranges of time contributions for each model to pages it was applied on (i.e., OCR was applied only on pages with bitmaps, table structure was applied only on pages with tables). Right: Average time contribution to a page in the benchmark dataset (factoring in zero-time contribution for OCR and table structure models on pages without bitmaps or tables) .<br /><h4>Annotations (ibm-granite/granite-vision-3.1-2b-preview)</h4>In this image we can see a bar chart and a line chart. In the bar chart we can see the values of Pdf Parse, OCR, Layout, Table Structure, Page Total and Page. In the line chart we can see the values of Pdf Parse, OCR, Layout, Table Structure, Page Total and Page.<br />\n",
       "<hr /><h3>Picture <code>#/pictures/4</code></h3><img src=\"\" /><br /><h4>Caption</h4>Figure 5: Conversion time in seconds per page on our dataset in three scenarios, across all assets and system configurations. Lower bars are better. The configuration includes OCR and table structure recognition ( fast table option on Docling and MinerU, hi res in unstructured, as shown in table 1).<br /><h4>Annotations (ibm-granite/granite-vision-3.1-2b-preview)</h4>In this image we can see a bar chart. In the chart we can see the CPU, Max, GPU, and sec/page.<br />\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from docling_core.types.doc.document import PictureDescriptionData\n",
    "from IPython import display\n",
    "\n",
    "html_buffer = []\n",
    "# display the first 5 pictures and their captions and annotations:\n",
    "for pic in doc.pictures[:5]:\n",
    "    html_item = (\n",
    "        f\"<h3>Picture <code>{pic.self_ref}</code></h3>\"\n",
    "        f'<img src=\"{pic.image.uri!s}\" /><br />'\n",
    "        f\"<h4>Caption</h4>{pic.caption_text(doc=doc)}<br />\"\n",
    "    )\n",
    "    for annotation in pic.annotations:\n",
    "        if not isinstance(annotation, PictureDescriptionData):\n",
    "            continue\n",
    "        html_item += (\n",
    "            f\"<h4>Annotations ({annotation.provenance})</h4>{annotation.text}<br />\\n\"\n",
    "        )\n",
    "    html_buffer.append(html_item)\n",
    "display.HTML(\"<hr />\".join(html_buffer))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Describe pictures with SmolVLM\n",
    "\n",
    "This section will run locally the [HuggingFaceTB/SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct) model to describe the pictures of the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from docling.datamodel.pipeline_options import smolvlm_picture_description\n",
    "\n",
    "pipeline_options = PdfPipelineOptions()\n",
    "pipeline_options.do_picture_description = True\n",
    "pipeline_options.picture_description_options = (\n",
    "    smolvlm_picture_description  # <-- the model choice\n",
    ")\n",
    "pipeline_options.picture_description_options.prompt = (\n",
    "    \"Describe the image in three sentences. Be consise and accurate.\"\n",
    ")\n",
    "pipeline_options.images_scale = 2.0\n",
    "pipeline_options.generate_picture_images = True\n",
    "\n",
    "converter = DocumentConverter(\n",
    "    format_options={\n",
    "        InputFormat.PDF: PdfFormatOption(\n",
    "            pipeline_options=pipeline_options,\n",
    "        )\n",
    "    }\n",
    ")\n",
    "doc = converter.convert(DOC_SOURCE).document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<h3>Picture <code>#/pictures/0</code></h3><img src=\"\" /><br /><h4>Caption</h4>Figure 1: Sketch of Docling's pipelines and usage model. Both PDF pipeline and simple pipeline build up a DoclingDocument representation, which can be further enriched. Downstream applications can utilize Docling's API to inspect, export, or chunk the document for various purposes.<br /><h4>Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)</h4>This is a page that has different types of documents on it.<br />\n",
       "<hr /><h3>Picture <code>#/pictures/1</code></h3><img src=\"\" /><br /><h4>Caption</h4>Figure 2: Dataset categories and sample counts for documents and pages.<br /><h4>Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)</h4>Here is a page-by-page list of documents per category:\n",
       "- Science\n",
       "- Articles\n",
       "- Law and Regulations\n",
       "- Articles\n",
       "- Misc.<br />\n",
       "<hr /><h3>Picture <code>#/pictures/2</code></h3><img src=\"\" /><br /><h4>Caption</h4>Figure 3: Distribution of conversion times for all documents, ordered by number of pages in a document, on all system configurations. Every dot represents one document. Log/log scale is used to even the spacing, since both number of pages and conversion times have long-tail distributions.<br /><h4>Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)</h4>The image is a bar chart that shows the number of pages of a website as a function of the number of pages of the website. The x-axis represents the number of pages, ranging from 100 to 10,000. The y-axis represents the number of pages, ranging from 100 to 10,000. The chart is labeled \"Number of pages\" and has a legend at the top of the chart that indicates the number of pages.\n",
       "\n",
       "The chart shows a clear trend: as the number of pages increases, the number of pages decreases. This is evident from the following points:\n",
       "\n",
       "- The number of pages increases from 100 to 1000.\n",
       "- The number of pages decreases from 1000 to 10,000.\n",
       "- The number of pages increases from 10,000 to 10,000.<br />\n",
       "<hr /><h3>Picture <code>#/pictures/3</code></h3><img src=\"\" /><br /><h4>Caption</h4>Figure 4: Contributions of PDF backend and AI models to the conversion time of a page (in seconds per page). Lower is better. Left: Ranges of time contributions for each model to pages it was applied on (i.e., OCR was applied only on pages with bitmaps, table structure was applied only on pages with tables). Right: Average time contribution to a page in the benchmark dataset (factoring in zero-time contribution for OCR and table structure models on pages without bitmaps or tables) .<br /><h4>Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)</h4>bar chart with different colored bars representing different data points.<br />\n",
       "<hr /><h3>Picture <code>#/pictures/4</code></h3><img src=\"\" /><br /><h4>Caption</h4>Figure 5: Conversion time in seconds per page on our dataset in three scenarios, across all assets and system configurations. Lower bars are better. The configuration includes OCR and table structure recognition ( fast table option on Docling and MinerU, hi res in unstructured, as shown in table 1).<br /><h4>Annotations (HuggingFaceTB/SmolVLM-256M-Instruct)</h4>A bar chart with the following information:\n",
       "\n",
       "- The x-axis represents the number of pages, ranging from 0 to 14.\n",
       "- The y-axis represents the page count, ranging from 0 to 14.\n",
       "- The chart has three categories: Marker, Unstructured, and Detailed.\n",
       "- The x-axis is labeled \"see/page.\"\n",
       "- The y-axis is labeled \"Page Count.\"\n",
       "- The chart shows that the Marker category has the highest number of pages, followed by the Unstructured category, and then the Detailed category.<br />\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from docling_core.types.doc.document import PictureDescriptionData\n",
    "from IPython import display\n",
    "\n",
    "html_buffer = []\n",
    "# display the first 5 pictures and their captions and annotations:\n",
    "for pic in doc.pictures[:5]:\n",
    "    html_item = (\n",
    "        f\"<h3>Picture <code>{pic.self_ref}</code></h3>\"\n",
    "        f'<img src=\"{pic.image.uri!s}\" /><br />'\n",
    "        f\"<h4>Caption</h4>{pic.caption_text(doc=doc)}<br />\"\n",
    "    )\n",
    "    for annotation in pic.annotations:\n",
    "        if not isinstance(annotation, PictureDescriptionData):\n",
    "            continue\n",
    "        html_item += (\n",
    "            f\"<h4>Annotations ({annotation.provenance})</h4>{annotation.text}<br />\\n\"\n",
    "        )\n",
    "    html_buffer.append(html_item)\n",
    "display.HTML(\"<hr />\".join(html_buffer))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use other vision models\n",
    "\n",
    "The examples above can also be reproduced using other vision model.\n",
    "The Docling options `PictureDescriptionVlmOptions` allows to specify your favorite vision model from the Hugging Face Hub."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from docling.datamodel.pipeline_options import PictureDescriptionVlmOptions\n",
    "\n",
    "pipeline_options = PdfPipelineOptions()\n",
    "pipeline_options.do_picture_description = True\n",
    "pipeline_options.picture_description_options = PictureDescriptionVlmOptions(\n",
    "    repo_id=\"\",  # <-- add here the Hugging Face repo_id of your favorite VLM\n",
    "    prompt=\"Describe the image in three sentences. Be consise and accurate.\",\n",
    ")\n",
    "pipeline_options.images_scale = 2.0\n",
    "pipeline_options.generate_picture_images = True\n",
    "\n",
    "converter = DocumentConverter(\n",
    "    format_options={\n",
    "        InputFormat.PDF: PdfFormatOption(\n",
    "            pipeline_options=pipeline_options,\n",
    "        )\n",
    "    }\n",
    ")\n",
    "\n",
    "# Uncomment to run:\n",
    "# doc = converter.convert(DOC_SOURCE).document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "docling-hgXEfXco-py3.12",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
